“328.77 million terabytes per day”
*90% of data was created in the last 2 years** https://explodingtopics.com/blog/data-generated-per-day
Do you have big data? Really?
It might look big, but you may not have enough rows to provide sufficient variance given the number of variables. Consider 20 Boolean variables – how many rows would you need to represent the possible number of permutations?
\[2^{20} = 1,048,576\]
Depends: could be based on the number of variables/features, number of outputs/classes…
1 in 10 rule…. for regression?
1000 lecimages/images for each class… for computer vision?
https://machinelearningmastery.com/much-training-data-required-machine-learning/
https://towardsdatascience.com/how-do-you-know-you-have-enough-training-data-ad9b1fd679ee
https://petewarden.com/2017/12/14/how-many-lecimages/images-do-you-need-to-train-a-neural-network/
The “one in ten rule” in statistics is a simple guideline for regression analysis that helps prevent overfitting. It states:
For every predictive variable (feature) you want to include in your model, you should have at least 10 events in your data.
For example:
If you violate this rule by including too many variables, your model may fit your current data well but will likely perform poorly on new data because it’s capturing random noise rather than true relationships.
This rule is often violated in fields like genomics where researchers might analyze thousands of genes with relatively few patients, leading to questionable findings.
“the future of big data is small data”
Kokol, P., Kokol, M. and Zagoranski, S., 2022. Machine learning on small size samples: A synthetic knowledge synthesis. Science Progress, 105(1), p.00368504211029777.
What is data? (What are data? Datum is the singular)
What is data analytics?
Why is data analytics important?
By Winchester City Council Museums - Flickr, CC BY-SA 2.0, https://commons.wikimedia.org/w/index.php?curid=39255824 “Medieval tally stick” (front and reverse view). The stick is notched and inscribed to record a debt owed to the rural dean of Preston Candover, Hampshire, of a tithe of 20d each on 32 sheep, amounting to a total sum of £2 13s. 4d.
| User | Variable 1 | Variable 2 | Variable 3 | Variable 4 |
|---|---|---|---|---|
| User1 | 304 | 200 | 438 | 637 |
| User2 | 501 | 302 | 384 | 278 |
| User3 | 503 | 627 | 893 | 923 |
Instances / rows
Columns/variables are also typically called features or attributes….
0-255 (1 byte) lecimages/Images - greyscale lecimages/image one pixel = 1 byte
ECG / time series – an array of numbers/amplitudes
Ptrump16, CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0, via Wikimedia Commons
DIKW Pyramid
Why is data analytics important?
Credit “Christoph Roser at AllAboutLean.com” https://commons.wikimedia.org/wiki/File:Industry_4.0.png
4th: More data is generated by sensors and devices
4th: Data is needed for AI algorithms
Value of data in services –mental health case study
Bond, R.R., Mulvenna, M.D., Potts, C., O’Neill, S., Ennis, E. and Torous, J., 2023. Digital transformation of mental health services. npj Mental Health Research, 2(1), p.13.
Caller Characteristics/features:
4 Steps: Data Preparation for Clustering analysis:
O’Neill, S., Bond, R.R., Grigorash, A., Ramsey, C., Armour, C. and Mulvenna, M.D., 2019. Data analytics of call log data to identify caller behaviour patterns from a mental health and well-being helpline. Health informatics journal, 25(4), pp.1722-1738.
Artificial Intelligence
Computational Intelligence
Intelligent Systems
Data Analytics
Data Mining
Data Science
Data Wrangling
Big Data
Machine Learning
Algorithms
Business Intelligence
Artificial Intelligence (AI): The simulation of human intelligence in machines programmed to think and learn like humans, performing tasks that typically require human intelligence.
Computational Intelligence: A set of nature-inspired computational approaches to address complex problems, including neural networks, fuzzy systems, and evolutionary computation.
Intelligent Systems: Computer systems designed to emulate aspects of human intelligence, capable of sensing, reasoning, learning, and acting autonomously.
Data Analytics: The process of examining datasets to draw conclusions about the information they contain, using specialized systems and software.
Data Mining: The practice of examining large databases to generate new information and identify patterns using statistical methods, machine learning, and database systems.
Data Science: An interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data.
Data Wrangling: The process of cleaning, structuring, and enriching raw data into a desired format for better decision-making and analysis.
Big Data: Extremely large datasets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions.
Machine Learning: A subset of AI that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.
Algorithms: A step-by-step procedure or formula for solving a problem or accomplishing a task, often used in data processing, calculation, and automated reasoning.
Business Intelligence: Strategies and technologies used by enterprises for data analysis of business information, providing historical, current, and predictive views of business operations.
Statistical modeling, signal processing, feature Business Analytics engineering, statistical computing, statistical learning, pattern recognition, computer vision…
Explore various definitions of the following phrases/terms:
Here are brief definitions of each term:
Statistics: The science of collecting, analyzing, interpreting, and presenting data to identify patterns and trends.
Data Science: An interdisciplinary field using scientific methods, algorithms, and systems to extract knowledge and insights from structured and unstructured data.
Data Analytics: The process of examining datasets to draw conclusions about the information they contain using specialized tools and techniques.
Data Mining: The practice of examining large databases to discover patterns and generate new information using automated methods.
Knowledge Discovery: The process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.
Machine Learning: A subset of AI enabling systems to learn and improve from experience without explicit programming.
Algorithms: Step-by-step procedures or formulas designed to solve problems or perform specific tasks.
Artificial Intelligence: The development of computer systems able to perform tasks that typically require human intelligence.
80% - 90%?
Wide and long data formats
| User | Page1Views | Page2Views | Page3Views | Page4Views |
|---|---|---|---|---|
| User1 | 304 | 200 | 438 | 637 |
| User2 | 501 | 302 | 384 | 278 |
| User3 | 503 | 627 | 893 | 923 |
| User | Page | n |
|---|---|---|
| User1 | Page1Views | 304 |
| User1 | Page2Views | 200 |
| User1 | Page3Views | 438 |
| User1 | Page4Views | 637 |
| User2 | Page1Views | 200 |
| User2 | Page2Views | 302 |
Also having open data does not necessarily mean that it is easily accessible data!
“It’s easy to let ourselves be driven by what we can do with the data, rather than by the most pressing clinical need. We see many AI solutions addressing the same tasks, because those are the tasks for which the data are available.”
Bethany Percha, Mount Sinai Health System
Mixed methods?
Many other non-ML topics:
Cockburn A, Dragicevic P, Besançon L, Gutwin C. Threats of a replication crisis in empirical computer science. Communications of the ACM. 2020 Jul 22;63(8):70-9.
- Leading to reproducibility problems
- “70% of researchers have tried and failed to reproduce another scientist’s experiments”
- https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970
- 3 types of reproducibility problems (Goodman, 2016):
- methods, results, and inferential reproducibility
Goodman, S.N., Fanelli, D. and Ioannidis, J.P., 2016. What does research reproducibility mean?. Science translational medicine, 8(341), pp.341ps12-341ps12.
Cockburn A, Dragicevic P, Besançon L, Gutwin C. Threats of a replication crisis in empirical computer science. Communications of the ACM. 2020 Jul 22;63(8):70-9.
Data Science Checklists
Reproducibility Checklist
https://www.cs.mcgill.ca/~jpineau/ReproducibilityChecklist.pdf
Pre-ML Checklist
https://services.google.com/fh/files/blogs/data-prep-checklist-ml-bd-wp-v2.pdf
Kolodner, J.L., 2002. The “neat” and the “scruffy” in promoting learning from analogy: We need to pay attention to both. The Journal of the learning Sciences, 11(1), pp.139-152.
Fayyad, Usama; Piatetsky-Shapiro, Gregory; Smyth, Padhraic (1996). “From Data Mining to Knowledge Discovery in Databases”(PDF). Retrieved 17 December 2008.
Workflow:
https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining
80% : 20% ?
Akin to Pareto’s law
https://medium.com/analytics-vidhya/a-data-cleaning-journey-2b0146407e44#
Data wrangling example scenarios
Male
m
Female
MALE
M
F
f
male
FEMALE
fem
Mal
…
SPSS vs. R
R vs. Python
“Functional Programming is when functions …are used as the fundamental building blocks of a program.”
c2.com/cgi/wiki?FunctionalProgramming
CSV file with 1,500,000 rows and 120 columns, each cell being a number.
https://en.wikipedia.org/wiki/Double-precision_floating-point_format
vector()
function.x <- c(0.5, 0.6) ## numeric
x <- c(TRUE, FALSE) ## logical
x <- c(T, F) ## logical
x <- c("a", "b", "c") ## character
x <- c(1,2,3) ## integer
x <- c(1+0i, 2+4i) ## complexNote: vectors are not 0 based